First
candy_file <- "candy-data.csv"
candy <- read.csv(candy_file, row.names=1)
head(candy)
## chocolate fruity caramel peanutyalmondy nougat crispedricewafer
## 100 Grand 1 0 1 0 0 1
## 3 Musketeers 1 0 0 0 1 0
## One dime 0 0 0 0 0 0
## One quarter 0 0 0 0 0 0
## Air Heads 0 1 0 0 0 0
## Almond Joy 1 0 0 1 0 0
## hard bar pluribus sugarpercent pricepercent winpercent
## 100 Grand 0 1 0 0.732 0.860 66.97173
## 3 Musketeers 0 1 0 0.604 0.511 67.60294
## One dime 0 0 0 0.011 0.116 32.26109
## One quarter 0 0 0 0.011 0.511 46.11650
## Air Heads 0 0 0 0.906 0.511 52.34146
## Almond Joy 0 1 0 0.465 0.767 50.34755
dim(candy)
## [1] 85 12
There are 85 different candy types
dim(candy)
sum(candy$fruity,na.rm=TRUE)
## [1] 38
There are 38 fruity candy types in the database.
candy["Twix", ]$winpercent
## [1] 81.64291
candy["Sour Patch Kids", ]$winpercent
## [1] 59.864
My favorite candy in the dataset is Sour Patch Kids and it has a winpercent value of 59.864%.
candy["Kit Kat",]$winpercent
## [1] 76.7686
Kit Kat has a win percent of 76.7686%.
candy["Tootsie Roll Snack Bars",]$winpercent
## [1] 49.6535
Tootsie Roll Snack Bars has a win percent of 49.6535%.
library("skimr")
skim(candy)
| Name | candy |
| Number of rows | 85 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| chocolate | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| fruity | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| caramel | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| peanutyalmondy | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| nougat | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| crispedricewafer | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hard | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| bar | 0 | 1 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| pluribus | 0 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| sugarpercent | 0 | 1 | 0.48 | 0.28 | 0.01 | 0.22 | 0.47 | 0.73 | 0.99 | ▇▇▇▇▆ |
| pricepercent | 0 | 1 | 0.47 | 0.29 | 0.01 | 0.26 | 0.47 | 0.65 | 0.98 | ▇▇▇▇▆ |
| winpercent | 0 | 1 | 50.32 | 14.71 | 22.45 | 39.14 | 47.83 | 59.86 | 84.18 | ▃▇▆▅▂ |
p100 because compared to all the other p value, that value has most of the values close to 1 while the other has most values with values that are 0.0.
If it has chocolate or not in the candy.
hist(candy$winpercent)
No, it seems like the values are more skewed to the left
The center of distribution seem to be below 50%.
choc_per <- candy$winpercent[as.logical(candy$chocolate)]
choc_per
## [1] 66.97173 67.60294 50.34755 56.91455 38.97504 55.37545 62.28448 56.49050
## [9] 59.23612 57.21925 76.76860 71.46505 66.57458 55.06407 73.09956 60.80070
## [17] 64.35334 47.82975 54.52645 70.73564 66.47068 69.48379 81.86626 84.18029
## [25] 73.43499 72.88790 65.71629 34.72200 37.88719 76.67378 59.52925 48.98265
## [33] 43.06890 45.73675 49.65350 81.64291 49.52411
mean(choc_per)
## [1] 60.92153
fruit_per <- candy$winpercent[as.logical(candy$fruity)]
fruit_per
## [1] 52.34146 34.51768 36.01763 24.52499 42.27208 39.46056 43.08892 39.18550
## [9] 46.78335 57.11974 51.41243 42.17877 28.12744 41.38956 39.14106 52.91139
## [17] 46.41172 55.35405 22.44534 39.44680 41.26551 37.34852 35.29076 42.84914
## [25] 63.08514 55.10370 45.99583 59.86400 52.82595 67.03763 34.57899 27.30386
## [33] 54.86111 48.98265 47.17323 45.46628 39.01190 44.37552
mean(fruit_per)
## [1] 44.11974
Chocolate on average is higher ranked than fruity candy.
t.test(choc_per,fruit_per)
##
## Welch Two Sample t-test
##
## data: choc_per and fruit_per
## t = 6.2582, df = 68.882, p-value = 2.871e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.44563 22.15795
## sample estimates:
## mean of x mean of y
## 60.92153 44.11974
It is statistacally significant since the p value is 2.9e-08, which is much lower than 0.05, which is usually the accepted value of when data is statistcally significant.
head(candy[order(candy$winpercent),], n=15)
## chocolate fruity caramel peanutyalmondy nougat
## Nik L Nip 0 1 0 0 0
## Boston Baked Beans 0 0 0 1 0
## Chiclets 0 1 0 0 0
## Super Bubble 0 1 0 0 0
## Jawbusters 0 1 0 0 0
## Root Beer Barrels 0 0 0 0 0
## Sugar Daddy 0 0 1 0 0
## One dime 0 0 0 0 0
## Sugar Babies 0 0 1 0 0
## Haribo Happy Cola 0 0 0 0 0
## Caramel Apple Pops 0 1 1 0 0
## Strawberry bon bons 0 1 0 0 0
## Sixlets 1 0 0 0 0
## Ring pop 0 1 0 0 0
## Chewey Lemonhead Fruit Mix 0 1 0 0 0
## crispedricewafer hard bar pluribus sugarpercent
## Nik L Nip 0 0 0 1 0.197
## Boston Baked Beans 0 0 0 1 0.313
## Chiclets 0 0 0 1 0.046
## Super Bubble 0 0 0 0 0.162
## Jawbusters 0 1 0 1 0.093
## Root Beer Barrels 0 1 0 1 0.732
## Sugar Daddy 0 0 0 0 0.418
## One dime 0 0 0 0 0.011
## Sugar Babies 0 0 0 1 0.965
## Haribo Happy Cola 0 0 0 1 0.465
## Caramel Apple Pops 0 0 0 0 0.604
## Strawberry bon bons 0 1 0 1 0.569
## Sixlets 0 0 0 1 0.220
## Ring pop 0 1 0 0 0.732
## Chewey Lemonhead Fruit Mix 0 0 0 1 0.732
## pricepercent winpercent
## Nik L Nip 0.976 22.44534
## Boston Baked Beans 0.511 23.41782
## Chiclets 0.325 24.52499
## Super Bubble 0.116 27.30386
## Jawbusters 0.511 28.12744
## Root Beer Barrels 0.069 29.70369
## Sugar Daddy 0.325 32.23100
## One dime 0.116 32.26109
## Sugar Babies 0.767 33.43755
## Haribo Happy Cola 0.465 34.15896
## Caramel Apple Pops 0.325 34.51768
## Strawberry bon bons 0.058 34.57899
## Sixlets 0.081 34.72200
## Ring pop 0.965 35.29076
## Chewey Lemonhead Fruit Mix 0.511 36.01763
The five least liked cany types in the set are Nik L Nip, Boston Baked Beans, Chiclets, Super Bubble, and Jawbusters
tail(candy[order(candy$winpercent),], n=15)
## chocolate fruity caramel peanutyalmondy nougat
## M&M's 1 0 0 0 0
## 100 Grand 1 0 1 0 0
## Starburst 0 1 0 0 0
## 3 Musketeers 1 0 0 0 1
## Peanut M&Ms 1 0 0 1 0
## Nestle Butterfinger 1 0 0 1 0
## Peanut butter M&M's 1 0 0 1 0
## Reese's stuffed with pieces 1 0 0 1 0
## Milky Way 1 0 1 0 1
## Reese's pieces 1 0 0 1 0
## Snickers 1 0 1 1 1
## Kit Kat 1 0 0 0 0
## Twix 1 0 1 0 0
## Reese's Miniatures 1 0 0 1 0
## Reese's Peanut Butter cup 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent
## M&M's 0 0 0 1 0.825
## 100 Grand 1 0 1 0 0.732
## Starburst 0 0 0 1 0.151
## 3 Musketeers 0 0 1 0 0.604
## Peanut M&Ms 0 0 0 1 0.593
## Nestle Butterfinger 0 0 1 0 0.604
## Peanut butter M&M's 0 0 0 1 0.825
## Reese's stuffed with pieces 0 0 0 0 0.988
## Milky Way 0 0 1 0 0.604
## Reese's pieces 0 0 0 1 0.406
## Snickers 0 0 1 0 0.546
## Kit Kat 1 0 1 0 0.313
## Twix 1 0 1 0 0.546
## Reese's Miniatures 0 0 0 0 0.034
## Reese's Peanut Butter cup 0 0 0 0 0.720
## pricepercent winpercent
## M&M's 0.651 66.57458
## 100 Grand 0.860 66.97173
## Starburst 0.220 67.03763
## 3 Musketeers 0.511 67.60294
## Peanut M&Ms 0.651 69.48379
## Nestle Butterfinger 0.767 70.73564
## Peanut butter M&M's 0.651 71.46505
## Reese's stuffed with pieces 0.651 72.88790
## Milky Way 0.651 73.09956
## Reese's pieces 0.651 73.43499
## Snickers 0.651 76.67378
## Kit Kat 0.511 76.76860
## Twix 0.906 81.64291
## Reese's Miniatures 0.279 81.86626
## Reese's Peanut Butter cup 0.651 84.18029
The top 5 all time favorite candy types are Snickers, Kit Kat, Twix, Reese’s Minatures, and Reese’s Peanut Butter cup
library(ggplot2)
ggplot(candy) +
aes(winpercent, rownames(candy),winpercent) +
geom_col()
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col()
my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "brown"
my_cols[as.logical(candy$fruity)] = "pink"
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col(fill=my_cols)
Sixlets
Starburst
library("ggrepel")
# How about a plot of price vs win
ggplot(candy) +
aes(winpercent, pricepercent, label=rownames(candy)) +
geom_point(col=my_cols) +
geom_text_repel(col=my_cols, size=3.3, max.overlaps = 5)
## Warning: ggrepel: 50 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Reeses Minature since it has over 80% winpercent with a little over 25% pricepercent.
ord <- order(candy$pricepercent, decreasing = TRUE)
head( candy[ord,c(11,12)], n=5 )
## pricepercent winpercent
## Nik L Nip 0.976 22.44534
## Nestle Smarties 0.976 37.88719
## Ring pop 0.965 35.29076
## Hershey's Krackel 0.918 62.28448
## Hershey's Milk Chocolate 0.918 56.49050
The top 5 most expensive candies are Nik L Lip, Ring pop, Nestl Smarties, Hershey Krackel, and Hersheys Milk Chocolate. The most unpopular candy of these are Nik L Nip
# Make a lollipop chart of pricepercent
ggplot(candy) +
aes(pricepercent, reorder(rownames(candy), pricepercent)) +
geom_segment(aes(yend = reorder(rownames(candy), pricepercent),
xend = 0), col="gray40") +
geom_point()
library(corrplot)
## corrplot 0.92 loaded
cij <- cor(candy)
corrplot(cij)
The original variables that are picked up strongly by PC1 in positive direction are fruity, hard, and pluribus.